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Abstract 

Quantitative portfolio allocation requires the accurate and tractable estimation 
of covariances between a large number of assets, whose histories can greatly vary in 
length. Such data are said to follow a monotone missingness pattern, under which 
the likelihood has a convenient factorization. Upon further assuming that asset re- 
turns are multivariate normally distributed, with histories at least as long as the total 
asset count, maximum likelihood (ML) estimates are easily obtained by performing 
repeated ordinary least squares (OLS) regressions, one for each asset. Things get more 
interesting when there are more assets than historical returns. OLS becomes unstable 
due to rank-deficient design matrices, which is called a "big p small n" problem. We 
explore remedies that involve making a change of basis, as in principal components or 
partial least squares regression, or by applying shrinkage methods like ridge regression 
or the lasso. This enables the estimation of covariances between large sets of assets 
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with histories of essentially arbitrary length, and offers improvements in accuracy and 
interpretation. We further extend the method by showing how external factors can 
be incorporated. This allows for the adaptive use of factors without the restrictive 
assumptions common in factor models. Our methods are demonstrated on randomly 
generated data, and then benchmarked by the performance of balanced portfolios using 
real historical financial returns. An accompanying R package called monomvn, contain- 
ing code implementing the estimators described herein, has been made freely available 
on CRAN. 

Key words: financial time series, monotone missing data, maximum likelihood, ridge 
regression, principal component regression, partial least squares, lasso, factor models 



1 Introduction 



Missingness in data, and hence the quest if one should eliminate a part of the data or try 
and estimate characteristics of it, is common in statistical analysis. The missing observation 
problem varies in style, depending on th e type of data. One exa mple is random missingness, 



which may stem from erroneous data (IDempster et al. 



19771 ). In financial returns data 



analysis, however, one problem stands out, which we will refer to as monotone missingness. 
This happens when the assets of interest have different lengths of historical financial data, 
e.g., stock prices and returns. There are several possible ways of dealing with this type of 
incomplete dataset. One way is by utilizing the portion of data available across all of the 
assets. Another approa ch involves estimating the missing portion, called imputation (e.g., 



Little and Rubin 



20021 ). A third approach is the focus of this paper. 
Aside from some glitches in data, which will typically give rise to unrealistic spikes or 
random missingness in data, the monotone style of missingness that permeates financial 



historical returns data can be grouped into two patterns. The first is where the histories 
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of assets differ due to the fact that they have started being publicly traded at different 
times. The second is where assets close for various reasons, including corporate actions such 
as M&A (Merger and Acquisition) activities, or liquidation due to bankruptcy. Both are 
critical problems to address when conducting a multivariate analysis. In this paper, we shall 
focus mainly on the former. This is sensible for the application to portfolio balancing that 
we have in mind, since one is naturally restricted to purchasing shares of companies which 
have survived up to current point in time. The latter type of missingness, in absence of the 
former, can be handled similarly, but it is not immediately clear how this would be useful for 
portfolio balancing. Handling both types of monotone missingness jointly, and other types o. 



approximately monotone mis singness, requires the method of data augmentation (IS chafer 



1997 



Little and Rubin 



20021 ). This could potentially be useful for a descriptive analysis, 
but is beyond the scope of this paper. 

Data with arbitrary missingness patterns typically require specialized iterative (even 
stochastic) estimation algorithms that can be slow and cumbersome to implement. However, 
data which follow a monotone missingness pattern lead to a likelihood which has a convenient 
factorization. If we further assume that asset returns are multivariate normally distributed 
(MVN), with histories at least as long as the total asset count, then maximum likelihood 
(ML) estimators are easily obtained by performing repeated ordinary least squares (OLS) 



regressions, one fo r each asset. In the finance lit erature, t 



Stambaughl (119971 ) , but it was first described by 



l is app roach is usually attributed to 



Andersen 



(119571 ) and has since been discussed 



in many texts (see Section [2]). The method fails when there are more assets than historical 
returns. In this case the OLS regressions become unstable due to rank-deficient design 
matrices. This is sometimes called the "big p small n" problem. It has recently received 
much attention in the statistics community, with ready applications in bioinformatics and 
genomics, for example. In the context of estimation for data with a monotone missingness 
pattern, it can severely limit applicability to cases with a small to modest level of missingness. 
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In financial applications, where there may be more assets than there are historical price 
observations for (some of) the assets, this essentially means that the method cannot be 
applied on the full set of assets of interest. This paper explores remedies to this problem. 
We aim to develop a method that can be applied in settings where some assets have histories 
which are shorter than the total number of assets, and even when there are more assets than 
observations. In short, our solution involves replacing OLS with "parsimonious regressions" 
that either make a change of basis, as in principal components or partial least squares 
regression, or apply shrinkage, like ridge regression or the lasso. This enables the estimation 
of covariances between large sets of assets with histories of essentially arbitrary (and uneven) 
length. Even in situations where OLS would have been sufficient, we find that the more 
parsimonious approach can offer improvements in accuracy and interpretation. 

The parsimonious approach also motivates novel ways of exploiting factor information , 



e.g., the value-weighted market index, size, and book-to-market factors (IFama and French 



19931 1 . Traditionally, factor models require the restrictive assumption that assets are inde- 
pendent given the factors. This underlying assumption can be thought of as a specific type of 
parsimony. We show how one can use the data to decide which independence constraints are 
reasonable, by incorporating the factors into our proposed framework, and furthermore how 
this may be accomplished even under condition of monotone missingness in the historical 
returns and factors. 

The remainder of the paper is organized as follows. Section [2] defines the monotone pat- 
tern for missing data, derives the corresponding factorized likelihood, and gives an algorithm 
of repeated regressions to analytically find a ML estimator for the case where the sampling 
distribution is assumed to be MVN. Section [3] outlines methods for dealing with the "big p 
small n" problem in the context of regression with transformed inputs and shrinkage esti- 
mators. We highlight the benefits of increased applicability, accuracy, and interpretability 
obtained with these methods. Section 0] gives the details of an algorithm — for MVN data 
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under a monotone missingness pattern — that combines the method in Section [2] with the par- 
simonious regressions in Section [3j We explain how the method can easily integrate factor 
information, generating a model that essentially mixes factor models with estimators that 
account for the direct dependency between returns. We then briefly describe an implemen- 
tation which has been made freely available as an R package called monomvn. Section [5] shows 
the method in action on synthetic data and real financial data with large numbers of assets 
having histories of highly varying length. Our results are benchmarked against several stan- 
dard comparators in the context of covariance estimation and portfolio balancing, and are 
accompanied by comments on interpretation, efficiency, and on the (benign) consequences 
of using a method that leverages an MVN assumption when that assumption not believed 
to hold. Finally, we conclude with a discussion in Section [6] that focuses on some of the 
limitations inherent in taking a maximum likelihood approach. 



2 Multivariate normal monotone missing data 



Let Y be a n x m matrix of random observations Yi j which may not be completely observed. 
Denote j/y = NA if the i th sample of the j th covariate is missing. In other words, if the columns 
of a sampled Y: y.^, . . . , y^ m , represent a historical return series of assets indexed by j and 
a return for asset j is not available at t ime i, then yj ~ = NA. Observed Y are said to follow 



a mo notone missingness pattern [e.g., (IS chafer 



1997 



2002 



Section 6.5.1) or (ILittle and Rubin 



Section 7.4)] if the columns can be arranged so that ^ NA whenever yij+i ^ NA. 
Figure [1] illustrates this property diagrammatically. The row dimension n, of Y, is equal to 
the number of completely observed samples n\ of yi = y. t i, the maximally observed column. 
Similarly, let = yi :n -j collect the complete data in the j th column of Y, so that rij > n J+1 . 
The monotone missingness patterns considered in this paper are assumed to be missing 



completely at random (MCAR) in that the pattern of missingness neither depends on the 
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Figure 1: Diagram of a monotone missingness pattern with m = 6 covariates, with a maxi- 
mum of n completely observed samples in yi = y. -y. 

observed nor unobserved responses. Note that there may be columns with identical missing- 
ness patterns. In the case of asset return series with observed histories going back different 
amounts of time, the MCAR assumption may be tenuous, but it is commonly asserted any- 
way (e.g., 



Stambaugh 



19971 ). In our notation, the time index (t) for an asset's return history 



would run counter to i, the index of the rows of Y; i.e, t = n — i + 1, as also illustrated in 
Figure HJ 

When the missing data pattern is monotone, the likelihood f{Y\6) can generally be 
factorized by exploiting an auxiliary parameterization = . . . , <fi m )' 

f( Y \ e ) = /(yi|0i)/(y2|yi, ^/(yalyi, y2, <f> 2 ) ■ • • f(y m \yi, • • • , y m -i, 4> m )- 

together with a mapping cp = $(0). With the appropriate conditioning, the yij are assumed 
to be independent and identically distributed (i.i.d.), so that 



f(yj\yu ■ ■ ■ yj-i t <t>j) = IJ f(vij\vi,i Vij-i, 4>j 



t=l 
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We are concerned with the case where the (y^i, . . • 2/;, m ) follow a multivariate normal dis- 
tribution (MVN) so that the likelihood in ([1]) also follows a MVN with constant variance 



and a mean linear in y^i, . . . , 



for financial returns data (e. 



ing assumptions (jStambaugh 



Mi 



1997 



Is. 



he i.i. d. and MVN assumptions may be less than ideal 



19271). but we note that these are common simplify- 



Chan et al. 



1999 



Jagannathan and Ma 



20031 ) because 



they lead to tractable inference and compare favorably (see Section for results and further 
discussion). Maximum likelihood estimators (MLEs) of 0j = (fij, Si^), j — 2, . . . ,m, can 
then be obtained by regression on the complete data: 



(2) 



where 0J = (0o,j,(3i,j, ■ ■ ■ an d X? = Y^_^ is the rij x j design matrix 



1 ••• 2/i,0-i) 



1 3 — 1 0: 



(i-i) 



1 2/2, 



2/2,0-1) 



\i 2/^,1 •■• y nj ,u-i)J 

containing an intercept column, and the first rij observations of the first j — 1 columns of Y. 
So the auxiliary parameters used in (j2J) are </> • = (/3 -, cr|). Figure [2] diagrams the design ma- 
trix (without the intercept term) and response vector involved in one such regression. When 
rank(Yj) = j, and particularly when rij > j, MLEs </> • are obtainable via the straightforward 
calculation: 

rij 

h = ( Y 7 Y ^) ' Y ;>v and ^ = -I to - Y A\\ 2 = - X>„- - (y7)i:n, ^) 2 . (3) 

/to /to 



Then, starting with 6 X comprising of fa = Y%=i 2/;,i/ n i, and s i,i = ZT=i(2/i,i _ fa) l n i 
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Figure 2: Diagram of the design matrix Y 5 (without an intercept term) and the response 
vector y 5 for the fifth regression involved in maximizing the likelihood of MVN data under 
a monotone missingness pattern with m = 6 covariates. 

each 0j can b e estimated condit ional on ^.y-i) = (Ai!(j-i)i^i:(j-i),i:(j-i)) an< ^ estimates of 

1997L 



(3 ■ and as (jStambaugh 



T 



Aj = A)J + /^l: 



Al:( 7 -1) and 



Cj-iJj^Cj-i) 



/ fl T S3 ^ 

A , l:(j-l)j 2 'l:(i-l).l:(i-l) 



(4) 



thus implicitly describing the mapping $ 1 back to Oj space. Observe that we do not use a 
bias-corrected estimator for <t| in ([3]), i.e., with n, - — 7 in stead of rij in the denominator, to 



ensure that ML estimates 6 are obtained (IS chafer 



19971 . pp. 224). However, we have found 



it to be beneficial in practice to use rij — 1 in the denominator as is typical in obtaining 
unbiased estimates of covariance matrices in the complete data case. 

When several columns ye, say I = ji, . . . ,j 2 , have equal lengths of observed histories ri£, 
it is typical to use a multivariate regression (y^ • ■ ■ y J2 ) = + e Ji: j 2 to find (3j V j 2 and 

the empirical variance-covariance matrix 'Vj 1 -.j 2 ,j 1 -.j 2 - Then, several 0j V j 2 can be found at once 

),1;(31-1) and 



by replacing (3j with (3j i: j 2 and er| with V j V .j 2; j i: j 2 in (jlj). Importantly, if Si.^ 
Xjiy'aJiy'a are positive definite, then Xiy 2 ,iy a will be positive definite as well (jStambaugh 
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19971 ). 

Calculating such MLEs requires having rij > j for all j = 1, . . . , m. That is, there cannot 
be an asset whose history is shorter than the number of assets whose histories have greater 
length. If such were the case, then Yj would not be of full rank, and YjYj could not 
be inverted in Eq. ([3]). This is sometimes referred to in the literature as the problem of 
regression with "big p [number of parameters] small n [number of observations]". Numerical 
singularities may arise whenever rij is less than, but nearly equal to, j — especially when n 
and m are large. In the following section we illustrate how these difficulties may be overcome 
by methods of subset selection, coefficient shrinkage, or the use of principal components. 

3 Parsimonious regression 

In this section, we extract and focus on the subproblem of the linear regression in (j2J), in 
terms of a design matrix of p predictor variables with an intercept term (X = Yj) observed 
for n cases, with corresponding responses (y = yj, where n = rij): 

y = X/3 + e, {e^^'N^a 2 ). (5) 

Ordinary least squares (OLS) gives a MLE of (3 = (X T X) -1 X T y. Classically, there are two 
main reasons why one may desire a more parsimonious approach to regression than that 
provided by OLS. The first is that OLS tends to lead to high variance estimators. The 
second is a desire for model fits that have high qualitative interpretability, i.e., that describe 
the data adequately but assume no more causes than will account for the effect. Our reasons 
for seeking an alternative are related to the former more so than the latter. But, most 
importantly, we aim to circumvent the problem of having linear dependence in the columns 
of Yj when rij < j. In this case, we are faced with annxp design matrix X with number 
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of columns p greater than the number of observations n, yielding an X T X matrix that is 
singular and cannot be inverted — a so-called "big p small n" (p > n) problem. We may even 
have that p>n, say, when the total number of assets m is far greater than the number of 
returns recorded for the asset with the shortest history. 

Popular solutions to this problem involve methods of variable selection a nd coefficient 



shrin kage. Probably the most straightforward method is subset selection (IHastie et al. 



2001 



Section 3.4.1) which aims to find the model with the "best" size k (i.e., with k E 
{1, . . . , min(p, n — 1)} covariates). "Best" can be defined in a number of ways, but typically 
involves t— tests, or minimizing an estimate of expected prediction error. Searching through 
all possible subsets quickly becomes infeasible for p > 40. Larger p can be handled by greedy 
methods, but these offer fewer guarantees. Such methods include forward stepwise selection 
which starts in the null (intercept only) model and sequentially adds predictors, and back- 
ward stepwise selection which starts at the saturated model (only applicable when p < n) 
and deletes predictors. Hybridizations also exist. 

By discarding some predictors, subset selection methods can yield a model which is 
more interpretable, and may have lower prediction error. But this "discrete" process can 
produce estimators with high variance. Shrinkage methods are a popular alternative. They 
are hailed for being more "continuous", and in some special cases they can have implicit 
behavior similar to methods like forward selection. The following subsection considers the 
shrinkage methods of ridge regression, and those related to the lasso. In Section 13.21 we 
consider another family of methods which are based on derived input directions: principal 
components regression, which has connections to ridge regression, and partial least squares 
regression. These are handy when the predictors are highly correlated. 

The parsimonious regression methods outlined in this section have been chosen for famil- 
iarity, computational tractability, and implementation. In each case R packages are available 
on the Comprehensive R Archive Network (CRAN), 
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http : / /cran . R-pro j ect . org 



(1R Development Core Teaml . 120071 ) 



which provide off-the-shelf implementations that will make for nice subroutines within the 
framework of constructing estimators for MVN data under monotone missingness. It is 
typical to first standardize the inputs (X and y) as the methods outlined below are not 
equivariant under re-scaling. 

3.1 Shrinkage methods: ridge regression, and the lasso 

Ridge regression and the lasso shrink the coefficients of an OLS regression by imposing a 
penalty on their size: 



(3 {q) = argmin I ^ I ^ - (3 - ^ x ijPj ) + X ^2\f 3 j\ 

P \ t=l \ j=l J j=l 



(6) 



with q = 2 for ridge regression, and q = 1 for the lasso. The tuning parameter A con- 
trols the amount of shrinkage. Notice that the intercept ((3q) is left out of the penalty 
term. Solutions to ([6]) can be obtained analytically in the case of ridge regression with 

A (2) 

(3 = (X T X + AI) -1 X T y. Quadratic programming is required for the lasso. Both methods 
have interpretations as Bayesian maximum a posteriori (MAP) estimators after imposing 
particular prior distributions. Other choices of q > are also possible, however the con- 
straint region for < q < 1 is non-convex, which makes solving the optimization problem 
more difficult. 

For ridge regression, the penalty parameter (A) is most advantageously chosen by min- 
i mizing cros s valid ation (CV) estimates of predi c tive e rror. The commonly used HKB 



( jHoerl et al. 



19751 ) and L-W ([Lawless and Wang 



19761 ) methods are computationally ef- 



ficient, but require that p < n to fit an O LS. The implementation of ridge regression used 



in this paper comes from the MASS library (IVenables and Ripley 



2002) for R in the form of 
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a function called lm. ridge. 



Though the form of ridge regression and the lasso are similar, there are several important 

" ( 2 ) 

differences. A large A will cause the ridge estimator /3 to have many coefficients shrunk 
towards zero. The lasso estimator ^ ^ has as similar effect, but, importantly, may contain 
many coefficients which are exactly zero — something which is only possible for < q < 1. In 
the Bayesian interpretation, setting q < 1 corresponds to choosing a prior which concentrates 
more mass on small \/3j\, with the most on /3j = 0. In this way, the lasso implements a kind 
of continuous subset selection. As A is increased, the \/3j\ decrease, eventually increasing 
the number of them which are identically zero, though this relationship need not be strictly 
monotonic. 

The implementation of lasso used in this paper is contained in the lars package for R 



( jHastie and Efron 



20071). 



Efron et al. 



(120041 ) show how the lasso, and two methods called 



stepwise and forward stagewise, are special cases of their method of least angle regression 
(LAR). LARS can calculate all possible lasso estimators with computational effort in the 
same order of magnitude as OLS regression applied to the full set of co variates. CV c an be 
used to select the final model, e.g., using the "one -standard-error" rule (jHastie et al 



2001 



Section 7.10), or a more thrifty C p ( iMallows 



1973T ) method can be used, but only when p < n. 



When applicable, the C p m ethod performs nearly a s well as CV within the MVN setting 



with monotone missingness. 
equally tame benchmarks. 



Madigan and Ridgewayl (120041 ) come to similar conclusions on 



( Ishwaran 



2004 



Stine 



however, C p has also been criticized for p referring large models 



20041 ) and for being slightly at odds with LARS (ILoubes and Massart 



20041 ). Since we are mostly interested in applying LARS methods (i.e., lasso) when OLS is 



not applicable, i.e., when p > n, we shall generally rely on CV to select the final model. 
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3.2 Principal components and partial least squares regression 

In situations where there are a large number of highly correlated inputs, a decomposition 
by principal components (PCs) can be used to select a small number of linear combinations 
of the original inputs to be used in place of X. The related methods of principal compo- 
nent regression (PCR) and partial least squares regression (PLSR) start by performing an 
orthogonal decomposition of X, but differ in how the linear combinations are constructed. 

In PCR, singular value decomposition (SVD) is performed on X, i.e., X = (UD)V T = 
TP T , where U is an n x p matrix of left singular vectors describing the "output basis", 
D is a diagonal matrix containing the corresponding singular values (a square-root of the 
eigenvalues) in non- decreasing order, V is a p x p matrix of right singular vectors describ- 
ing the "input basis", and T and P are the so-called scores and loadings defined by the 
decomposition. Next, y is regressed on the first k PCs, i.e., the scores T(k), where the (k) 
subscript indicates the extraction of the first k columns of T, i.e., the first k columns of U, 
V, and the first k rows/cols of D. Since the columns of T are orthogonal, the solution is 
just a sum of univariate regressions. Importantly, the solution can then be written in terms 
of the coefficients on the predictors in the columns of X, 

(arbitrary scores and loadings) /3(k) = P (fc) (Tj. ) T {A . ) )~ 1 Tj c) y (7) 

(from SVD on X) ^\k) = V^D^U^y, 

a vector of length p. When k = p < n, the coefficients in ([7]) are identical to those obtained by 
OLS. There are many ways of choosing how many components (k) to keep in the final model. 
One way is to consider the relative sizes of the eigenvalues as a proportion of the variation 
explained by each principal component, and then choose k so that 80-90% of the variation 
is explained. A less ad hoc and more reliable — but more computationally intensive — method 
that can be applied even when p > n involves using CV to estimate predictive error in order 



13 



to find k G {1, . . . , min(p, n — 1)}. 

PLSR, by contrast, aims to incorporate information about both X and y in the scores 
and loadings — which in this context are often called latent variables (LVs) — by proceeding 
iteratively. The method is initialized with the SVD of X T y, thereby including information 
about the correlation between, and the variance within, X and y. The scores and loadings 
obtained by PLSR optimally capt ure the covaria nce between X and y, whereas PCR concen- 



trates only on the variance of X (jde Jong 



19931 ). There are several algorithms for obtaining 
the scores and loadings, but once obtained, the regression coefficients /3 P (k) in X-space are 
recovered by following ([7]), and CV can be similarly used to pick k. 

In situations where a minor component of X is highly correlated with y, PLSR may have 
a significant advantage over PCR. Otherwise, the methods have a more or less comparable 
performance record despite a few operational differences — e.g., PLSR usually needs fewer 
LVs, but can also yield higher variance estimators of the regression coefficients. Both have 
behavior simila r to other shrinkage metho ds, particularly ridge regression. For example, it 



can be shown (IFrank and Friedman 



19931 ) that ridge regression shrinks the coefficients of 
principal components by a factor of dj/(dj + A), where the dj are from the diagonal of D, 
whereas PCR truncates them at k. 



An R package called pis (IMevik and Wehrensl. 



tion of PCR and three algorithms for PLSR (iDayal and MacGregor 



Martens and Naes 



a unified implementa- 


. 1997: 


de Jong. 


1993; 



19891 ). together with built-in facilities for estimating k via CV. 



4 The monomvn algorithm 

So long as rij > j for all j = 1 . . . , m, and rij > n J+1 , an algorithm for finding the parameters 
H and S that maximize the MVN likelihood for monotone missing data proceeds as outlined 
in Section El Initialize \i\ and En to the sample mean and variance of the first column yi 
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of Y, then iterate through the following steps for j = 2, . . . , m: 

1. Find the MLEs ([3]) of f3j and cxj in a regression (T5]) of yj onto the first j — 1 columns 
of Y (as predictors), using only the first rij observations; 

2. Obtain the MLEs of fij and from Ai : fj— i): flj an d c*f as in (jlj). 

If any rij < j, then we have a "big p small n" problem, and the standard regression in step 
1 above cannot be performed. In practice, it may be that rij > j and still there are columns 
of the design matrix which are not linearly independent, and so it is not of full rank. The 
chances that this may happen become increasingly more likely as j approaches rij when finite 
(double-precision) computer representations make it so that the design matrix is numerically 
rank deficient. Both issues are addressed simultaneously by instead performing one of the 
parsimonious regressions outlined in Section [3j Then step 2 can proceed as usual. Observe 
that this approach also enables estimation when there are more assets than historical returns 
(m > n). 

4.1 Choosing the parsimonious proportion 

Even when parsimonious regression is not strictly necessary, it can aid in interpretation, and 
possibly even yield more accurate and lower variance estimators. The lasso and the other 
LARS methods can choose to shrink /3 so that only the intercept term is nonzero. This 
enables the detection of zeros in the MVN covariance matrix X. In other words, it can be 
used as a test, of sorts, for independence between assets. 

Towards building a more efficient and interpretable estimator, one may consider applying 
a parsimonious regression for every iteration of step 1 above. This is explored further in 
Section 15731 Alternatively, one could determine a threshold, say p, representing a proportion 
of rows to columns in the design matrix past which a parsimonious regression is applied 
regardless. That is, when rij < pj, for < p < 1. Then, the p = case corresponds to 

15 



always using a parsimonious method, and p = 1 reverts to applying one only when necessary. 
In Section 15.1.21 we show how easy it is to establish reliable rules of thumb for choosing p. 



4.2 Incorporating factors 

A popular estimator for the covariance matrix of financial asset returns involves using factor 
models. The essential idea behind the factor model is to regress the observed returns yj on 
measured common market factors F, and to derive a covariance matrix of the returns as a 
function of the regression equations. 

For a factor space with K factors, the model can be formalized as follows. Each excess 
return y^ is modeled by the regression equation 



K 

Ui,j = -Vj + Afc + e M 



k=l 



where each is a residual term independent of F. The residual terms for the i th instance 
are assumed to follow a zero-mean MVN with diagonal covariance matrix D. For instance 



a com mon one-factor model takes / to be value-weighted market index (e.g., 



Chan et al 



19991 ). A common three-fact or model augments the y alue- weighted market index with size 



and book-to-market factors ( Fama and French 



1993|). 



Factors are assumed, for now, to be i.i.d. and to follow a MVN with K X K covariance 
matrix f2. Let A be the K x m matrix defined by the entries A^j = Xkj, for k — 1, . . . , K. 
It follows that the covariance matrix of the returns, as parameterized by {fi, A, D}, is given 
by 

£(/) = A T fiA + D. (9) 

" (/) 

An estimate S can therefore be obtained by estimating each column \j = (Xij, ■ ■ ■ , Xk,j) t 
of A by regressing j/ ■ on F with an intercept. The mean sum of squares of the residuals of 
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each regression forms the diagonal of D, and the off-diagonal entries are zero. The estimate 
O is the empirical covariance of the factors. Note that each regression equation requires only 
the data observed for the particular return y,, together with the corresponding observations 
for the factor (s). However in practice, the method is applied only to completely observed Y 
and F. 

The main underlying assumption is that returns are mutually independent conditioned 

on the factors. If the number of factors is considerably smaller than the number of returns, 

- (/) 

the model will be parsimonious and the resulting £ ' will have lower variance than the 
empirical covariance matrix. This assumption allows for any missingness pattern, even the 
extreme one where no joint observation of returns and y^ exists. The drawback is that 
the independence assumptions encoded in this model might be unrealistic, and the resulting 
estimate will suffer from a strong bias. 

Instead, we can use the data to find which independence assumptions are adequate by 
integrating the factor model into the monomvn framework. Consider the full regression model, 
where we regress y, on Yj and Fj = F^^ simultaneously: 

y, YA, ■ F A • (io) 

The A j term does not appear because it is not identifiable given the presence of /3 j. Since 
this formulation is in the same family of parameterizations of the original models used in 
monomvn, an analogous procedure applies with minor pre- and post-processing. First shift 
the labels the returns for each asset by K so that y, becomes yj+x and the corresponding /3 ■ 
becomes (3j + x- Then map F& to Y^ and Xk to f3 k . If the recursion in Eq. (j4]) is then applied 
as usual, giving the estimates fi [an (m + K) vector] and £ [an (m + K ) x (m + K ) matrix] , 
an estimate of the covariance matrix of the asset returns can then be extracted from the 
bottom-right m x m block of S, i.e., £ = ^(K+iy.(m+K),(K+i)-.(m+K)- The superscript 
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(f + m) is meant to indicate dependence on both factors and assets. Importantly, no internal 
changes to the workings of the monomvn algorithm are necessary. 

Observe that if the (parsimonious) regression method applied within monomvn uses OLS 
whenever regressing onto the factors, and sets the regression coefficients to zero otherwise, 
then we obtain ^ = In the context of monomvn we call this the "factor-parsimony" 

regression, filling a role similar to PCR, lasso, etc. If required, the covariance matrix of the 
factors can also be recovered as f2 = Also observe that, within the monomvn 

framework, it is possible to handle factors with historical missingness. 

If, instead of the factor-parsimony method, any of the other methods (outlined in Section 
[3]) are used, then shrinkage is applied to both /3 - and Xj in fflUj) . In this case we obtain a 
generalization of the independence structure assumed in the classical factor model, allowing 
the data (factors and returns) to determine the appropriate mix of influence on the resulting 
estimator for S. It is interesting to point out the link between t his generalized fac t or mo del 
(fTUjl resulting in X , and the optimal shrinkage estimator of 



Ledoit and WoHl (120021 ): 



± (e) = a± if) + (1 - a)± (c \ for a e [0,1]. (11) 



Here, is the standard covariance estimate obtained using only the portion of the data 
available across all assets and a is an "optimal" mixing proportion chosen by CV. (Note 

-s (/) 

that Ledoit's factor-based estimator S uses only completely observed joint returns.) The 
spirit of these two approaches is similar, but they are quite distinct. The published success 
of this type of shrinkage approach suggests that it is important to combine a (complete data) 
factor-based estimate with a traditional covariance estimate. Indeed, the estimator 
involves combining covariances mediated by factors with covariances that are not accounted 
for by factors; it can also handle historical missingness via the "factor-parsimony" regressions 
within monomvn. But rather than shrinking a (possibly) non-positive definite estimator 
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towards T,^' with a single parameter a as in ( fTTj) . monomvn applies m + K unique shrinkage 
parameters, one for each regression, while taking full advantage of all available returns. 



4.3 Software 



Finally, an R package called monomvn (IGramacy , 



20071 ) has been made freely available through 



CRAN. It implements the algorithm described in this section, and supports all of the par- 
simonious regression methods outlined in Section [3] via the stand-alone packages outlined 
therein. Two forms of CV are supported for choosing the number of components in the 
parsimonious regression: random 10-fold and (deterministic) leave-one-out (LOO). A p 
argument facilitates parsimonious regression modeling, as described above. Incorporating 
factors is as straightforward as bundling them in as if they were returns, as described above. 



5 Empirical results 

In this section, the monomvn methods are illustrated and validated on real and synthetic 
data. In Section 15.11 we focus on the properties of estimates of fi and S in a controlled 
setting involving synthetic data under monotone missingness. In 15.21 we turn to applying the 
estimators towards balancing portfolios in a mean-variance setting. We wrap up in 15.31 by 
using monomvn in a descriptive analysis of dependence involving thousands of assets. 

5.1 Properties of the estimators on synthetic data 

Here, we use a data-generation mechanism provided by the monomvn package: randmvn gen- 
erates random samples from a randomly generated MVN distribution with an i.i.d. standard 
normal mean vector /j,, and an Inv-Wishart sampled X; rmono imposes a uniformly dis- 
tributed monotone missingness pattern. A similar method is used to generate samples with 
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monotone missingness from a multivariate t distribution (MVt) as well, in order to demon- 
strate that the MVN-based monomvn methods still perform well in the presence of heavier 
tailed data. 

The comparisons to follow focus on highlighting the relative strengths and weaknesses of 
variations of monomvn as a function of the choice of parsimonious regression method applied. 
Additionally, two simpler methods are devised as calibration tools, and to illustrate the 
advantage of the monomvn approach over those which do not leverage the structure of the 
monotone missingness pattern. The simplest comparator is called "complete" , where \l and 
X are estimated using only the portion of data available across all assets, i.e., only the 
completely observed returns. Put yet another way: only the first n m rows of Y are used. 
Another comparator is "observed" which uses all of the available data in an obvious but 
naive way: 



1 \ 



and 



k=l 



m 

1 . . 



forz = l,...,j. (12) 



k=l 



Unfortunately, the covariance matrices provide d by the "complete " and "observed" estima- 



tors are not guaranteed to be positive-definite (IStambaugh 



19971). 



As a final comparator, we consider a me t hod o f estimation for incomplete data for arbi 



trary missingness patterns ([Dempster et al 



mization (ECM) algorithm (IMeng and Rubin 



.9771). using the expectation conditional maxi- 



19931 ). Consequently, this method also works 



when the missingness pattern is monotone, but represents a sort of overkill in this case. 
Two similar software packages are available for this method when the data is assume d to 



20021 ) for 



follow a multivariate normal distribution: the norm package (INovo and Schaferl . 
R, and ecmnmle (contained in the Matlab Financial Toolbox). We prefer norm because 
its core is implemented in compiled Fortran, with an R wrapper. It gives nearly identical 



results to — but runs more than 20 times faster than — ecmnmle which is written solely in 
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Matlab. The ECM method iterates until convergence, stopping at a local maximum when 
an improvement threshold is met. As a result, its computational demands and the ultimate 
optimality of the resulting estimator are sensitive to the initial configuration of the algo- 
rithm. Though the missingness pattern may be arbitrary, it is well-known that the method 
can fail due to convergence issues and/or numerical singularities that can arise due to finite 
machine representations when more than 15% of the data is missing (see, e.g., the ecmnmle 
documentation within Matlab). So it cannot handle m > n, which precludes it from general 
use in our problem. 

The expected log likelihood (ELL), which is related to the Kullback-Leibler (KL) diver- 
gence, is used as the main metric for comparisons. For probability distribution functions 
(PDFs) p and q, the KL divergence between p and q is defined as 

/v( x ) 
p(x) log — - dx. 
q{x) 

In the particular case where q is the estimated MVN with parameters fi and £ and p is the 
"true" parameterization with /j, and S, the KL divergence can be shown to be: 

A<l(MVN(£, £) II MVN(^, £)) = i (log j|j + tr(if 's) + (£ - fi) T ±~\ii - /x) 

The ELL of q relative to data sampled from p is given by 

Epjlogg} = J p(x) \ogq(x) dx 

= J p{x)\ogp(x) dx - D Kh (q\\p). (13) 

The integral J plogp in ( ITBl) is the entropy of p. The entropy of MVN(/x, X) can be shown 
to work out to — \ log{(27re) Ar |S|}. When analytical expressions are not available it is easy 
to approximate ([TBI numerically by T' 1 Y^f = i^ogq(x t ), where x t ~ p is simulated out of 
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sample. This nicely converges to the truth for large T. The ELL is good for ranking 
competing estimators, however actual "distances" between estimators is hard to interpret. 



5.1.1 Comparing estimators 

Figure [3] (left) summarizes a comparison between the different parsimonious regressions 
within the monomvn algorithm, using randomly generated MVN data with m = 100 and 
n = 1000, repeated over 100 trials, each time sampling new /it, S and Y ~ MVN(/x, I]) with 
uniform monotone missingness. Parsimonious regressions were used only when necessary 

ELL ranks on MVN data ELL ranks on MVt data 

CO - 

in - 



CM - 

plsr per ridge lasso lar fwdstag step obs complete plsr per ridge lasso lar fwdstag step obs complete 

method method 

Figure 3: Comparison of parsimonious regression (p = 1) methods (using 10-fold CV) 
on randomly generated MVN data (n = 1000 samples, m = 100 dimensions) data with 
H ~ iV m (0, 1), S ~ Inv-Wishart and uniform monotone missingness: boxplots of ELL ranks 
summarizing 100 repeated trials. 

(i.e., p = 1). 10-fold CV was used to choose A or the number of (principal) components. As 
can be seen from the table, PCR emerges as the clear winner in this comparison, nearly al- 
ways having the best ELL rank. The complete and observed comparators are almost always 
ranked worst. 

In anticipation of the application in Section 15.21 to financial returns data, which are 
believed to follow a heavier tailed distribution than MVN, we repeated the above experiment 
with synthetically generated MVt data with a monotone missingness pattern. The degrees 
of freedom parameter was sampled as v ~ Exp(|) + 1. Figure [3] (right) shows roughly similar 
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behavior for the MVN based monomvn estimators when fit to MVt data: PCR is the best 
and the observed and complete estimators are the worst (although the order is switched). 
ELL was computed numerically using the known degrees of freedom parameter (s), is, which 
generated the data. This is a legitimate choice since the is is not used in the mean-variance 
analysis to follow in Section 15.21 It is interesting to note the improved rank(s) of the ridge 
regression based estimator in this case. 

These results are in line with those of previous simulation studies which compare ML 
estimators — that are able to leverage all of the available data by exploiting the MVN 
assumption — to those which use more reasonable distributional assumpt ions but whi ch, for 
reasons of tractability, can only use the completely observed cases (e.g., iLittlei . Il988l ). The 
evidence suggests that making use of all of the available data in a sensible way is the crucial 
ingredient despite that the underlying assumptions may be viol ated. The dominanc e of PCR 



20071 ) showing 



in both MVN and MVt scenarios is in line with a recent study (iDe Mol et al. . 
that PCR out-competes other shrinkage (Bayesian motivated) estimators in applications 
with a large number of financial asset returns. 

5.1.2 Choosing the parsimonious proportion 

Recall from Section H] that p G [0, 1] determines when a parsimonious method is to be used 
instead of OLS in the monomvn algorithm. The experiment performed here is similar to the 
previous one, except that n and m are varied stochastically with m uniform in {5, . . . , 100} 
and n\m uniform in {max(10, \m/2\), . . . , md}. Table [I] shows the mean and 90% interval 
for the optimal p over 100 repeated trials sampling new m, n, etc., each time. LOO CV was 
used to choose A, or the number of (principal) components, and the objective criteria used 
was ELL. The final column in the table shows the proportion of time when p = 0.25 was 
better than p — 0. Observe that all methods except ridge regression work well, as a rule of 
thumb, with p = 0.25. All things being equal, a larger p setting may be preferred for speed 
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method 



optimal p 
5% mean 95% 



improv 



plsr 

per 

ridge 

lasso 

lar 



0.12 0.23 0.37 

0.09 0.27 0.51 

0.04 0.25 0.67 

0.12 0.24 0.38 

0.11 0.26 0.41 

0.15 0.26 0.39 



0.55 
0.69 
0.29 
0.76 
0.65 
0.74 



stepwise 



Table 1: Mean and 90% interval for optimal p, the ratio of columns to rows in the design 
matrix before switching from OLS to a parsimonious regression. The improv column gives 
the proportion of runs for which p = 0.25 is better than p = 0. We repeated this over 100 
trials with LOO CV with the ELL as an objective. 

reasons. 

5.1.3 Comparing to ECM 

Due to the limitations of ECM-based methods, like those implemented by norm and ecmnmle, 
a comparison of monomvn to these approaches requires a more controlled experiment. Fixing 
m = 10 and n = 100, 1000 repeated experiments similar to the ones described above, with 
uniform monotone missingness, gave that monomvn (with PCR) had higher ELL 997 times 
(100%) and that ECM failed to converge 53 times (~ 5%). As n grows relative to m, the 
performance of the methods converge. For example, with m — 10 and n = 1000 the means 
are monomvn is better 831 times (83%), and ECM failed to converge 11 times (1%). As the 
dimensionality (m) increases modestly compared to the sample size (n), the ECM-based 
norm algorithm consistently diverges. For example, with m = 20 and n = 100 norm fails to 
converge more than 40% of the time. 

5.2 Constructing portfolios from historical returns 

In this section we examine the characteristics of minimum variance portfolios constructed 
using estimates of S based on historical monthly returns. The experimental setup is similar 
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to ones that have been used in se veral recent papers on covariance estimation, an d minimum 



variance portfolio balancing (e.g. 



Chan et al. 



1999 



Jagannathan and Ma 



20031 ). Following 



these works we use the monthly returns of common domestic stocks traded on the NYSE 
and the AMEX from April 1968 until 1998. We require that the stocks have a share price 
greater than $5 and a market capitalization greater than 20% based on the size distribution of 
NYSE firms. Estimators of I] are constructed based on (at most) the most recently available 
60 months of historical returns. This is in keeping with previous work and acknowledges 
that the i.i.d. assumption in Eq. (CQ) is only valid locally (in time) due to the conditional 
heteroskedastic nature of financial returns. Short selling is not allowed; all portfolio weights 



must be nonnegative. Although it is typic al to cap t 



order to "tame occasional bold forecasts" 



poor estimators (IJagannathan and Ma 



r e wei ghts as well, e.g., at 2%, in 



Chan et al 



19991 ) that typically arise due to 



20031 ). we specifically do not do so here. Our goal is 
fully expose the quality of the estimators and to illustrate that with good estimators such 
rules of thumb are unnecessary. 

Four classes of estimators of S are used in the comparisons which follow. (1) The 
complete estimator outlined earlier, with variations depending on how many assets have 
historical returns with certain lengths (more below). (2) A one-factor model using the 
return on the value-weighted portfolio of stocks traded on the NYSE, AMEX, and Nasdaq. 

(3) The monomvn method using the parsimonious regressions of Section [3] with p = 0.25. 

(4) The monomvn method incorporating the value-weighted portfolio as a factor with, as 



described in Section 14.21 and with p = 0. For this class we augment the collection of 
parsimonious regressions to include the "factor-parsimony" method. We do not compare 
to the ECM methods of norm or ecmnmle here, as this has proved to be both cumbersome 
and troublesome; the methods seem unable to handle the missingness level in this data. For 
example, norm consistently fails to converge even after thousands of very slow iterations of 
ECM (each taking several seconds on a 3.2 GHz Xeon). 
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To assess the quality and characteristics of the constructed portfolios we follow 



Chan et al. 



( 119991 ) in using the following: (annualized) return and standard deviation; (annualized) 
Sharpe ratio (average return in excess of the Treasury bill rate divided by the standard de- 
viation); (annualized) tracking error (standard deviation of the portfolio return in excess of 



the S&P500 return); correlation to the market (S&P500 return); averag e number o: 



with weights above 0.5%. We c losely follow the experimental setup of 



Chan et al. 



stocks 



(11999 ) 



and iJagannathan and Ma! (j2003l ) by randomly subsampling from the qualifying stocks in 
each year, and holding the portfolios for the entire subsequent 12 months. The random 
subsample reduces the size of the estimation problem, and thus computational burden, so 
that many methods can be simultaneously benchmarked against one another. It can also 
serve the dual purpose of enabling the calculation of nonparametric (bootstrap-like) Monte 
Carlo assessments of variability, which was not a feature explored in previous work. 

Specifically, in each April, starting in 1972, we randomly subsample 250 stocks (without 
replacement) from those which qualify (in the sense outlined above) and which have at least 
12 months of historical returns. In this way our work differs slightly from our predecessors 
whose estimators require exactly 60 months of historical returns. We chose 12 months 
in order to highlight the benefit of incorporating assets in the portfolio with fewer than 
60 months of returns via monomvn. Estimates of the covariance matrix of monthly excess 
returns (over the monthly Treasury Bill rate) are generated form the different models using 
at most the last 60 months of historical returns for the 250 assets. Based on the estimate(s), 
quadratic programming is used to find the global minimum variance portfolio(s) described 
by weights w = argmin w w T Sw. Then, the weights w are applied to form buy-and-hold 
portfolio returns until the next April, when the randomization, estimation, and optimization 
steps are repeated and the portfolios are reformed. 

Table [2] summarizes the properties of those returns averaged over 50 repeated random 
paths through the 26 years in the study. The table is broken into five sections, vertically, 
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method 


mean 


sd 


sharp e 


te 


cm 


wmin 


eq 


0.149 


0.188 


0.432 


0.062 


0.949 





vw 


0.135 


0.162 


0.412 


0.032 


0.981 


45 


min 


0.147 


0.183 


0.431 


0.105 


0.819 


29 


com 


0.150 


0.183 


0.447 


0.107 


0.810 


26 


rm 


0.132 


0.129 


0.494 


0.094 


0.803 


16 


fmin 


0.142 


0.146 


0.503 


0.086 


0.845 


38 


fcom 


0.144 


0.146 


0.521 


0.087 


0.841 


37 


frm 


0.138 


0.130 


0.537 


0.117 


0.688 


21 


plsr 


0.148 


0.154 


0.516 


0.124 


0.686 


15 


per 


0.143 


0.132 


0.563 


0.109 


0.732 


23 


ridge 


0.158 


0.165 


0.546 


0.122 


0.716 


16 


lasso 


0.151 


0.150 


0.550 


0.054 


0.941 


69 


lar 


0.151 


0.151 


0.545 


0.053 


0.944 


71 


step 


0.152 


0.155 


0.541 


0.052 


0.946 


75 


ffp 


0.143 


0.132 


0.566 


0.113 


0.712 


24 


fplsr 


0.147 


0.153 


0.514 


0.123 


0.688 


15 


fpcr 


0.142 


0.131 


0.560 


0.109 


0.732 


21 


fridge 


0.158 


0.163 


0.554 


0.119 


0.726 


19 


flasso 


0.152 


0.148 


0.561 


0.056 


0.936 


69 


flar 


0.151 


0.151 


0.546 


0.053 


0.943 


70 


fstep 


0.154 


0.153 


0.558 


0.055 


0.939 


73 



Table 2: Comparing statistics summarizing the returns of yearly buy-and-hold portfolios 
generated over 50 repeated random paths through the 26 years of monthly historical returns. 
The first group of rows show the equal- and value-weighted portfolios; the second group 
of rows have complete data estimators based on the preceding 12-months of returns, the 
maximal completely observed historical returns, and the returns for the subset of assets with 
60 months of historical returns; the third group uses the same returns as the second with a 
one-factor model; the penultimate group uses monomvn; the final group uses monomvn with 
the additional one-factor. The statistics across the columns are (annualized) mean return, 
standard deviation, Sharpe ratio, tracking error, correlation to market and average number 
of stocks with weights above 0.5%. 

starting with the equal- and value-weighted portfolios (for comparison), followed by global 
minimum variance portfolios based on estimated S: complete data estimators, complete 
data estimators based on a one-factor model, monomvn estimators, and monomvn estimators 
incorporating the one-factor. Throughout, the "f" prefix indicates that the estimator uses 
the value-weighted factor in some way. The "min" and "fmin" estimators use only the 
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le case of "frm". All in all, these 



Chan et al. 



1999l ) showing that, 



last 12-months of historical returns, whereas the "com" and "fcom" estimators use the 
maximal complete history available. The "rm" and "frm" estimators focus only on those 
assets with completely observed returns for the last 60 months — where the weights for the 
other assets are set to zero (removing them from the portfolio). The annualized mean, 
standard deviation, and Sharpe ratio statistics for these six estimators lead one to conclude 
that the more historical returns (within the five-year window) that can be used to estimate 
S the better. Tracking error is also improved, except in t 
results support those obtained in previous studies (e.g., 
in particular, factor models improve upon the naive estimator in the complete data case. 
Further inspection of this part of the table reveals that the improved Sharpe ratios for "rm" 
and "frm" are due to the smaller standard deviation obtained under these estimators, but 
that this comes at the expense of a smaller mean return. This may be due to more weight 
being placed on fewer assets (as indicated in the "wmin" column). Both "rm" and "frm" 
also have the lowest correlation to the market in their cohort. 

The final two groups of rows tell a similar story. The Sharpe ratios for the monomvn 
estimators — with and without the value-weighted factor — show marked improvements over 
the complete data estimators. As before, the inclusion of the value-weighted factor further 
adds to the improvement, e.g., yielding higher Sharpe ratios except in the case of PCR 
where they remain essentially unchanged. The "ffp" estimator, i.e., the one-factor model 
applied via monomvn using the "factor-parsimony" regression method, has the lowest stan- 
dard deviation, and therefore a comparatively high Sharpe ratio despite a low mean return. 
We can see that, as with "rm" and "frm", this low standard deviation is obtained by plac- 
ing large weight on only a few assets. PCR, PLSR, and ridge regression — both with and 
without factors — show similar properties. In contrast, the LARS estimators (lasso, lar, and 
stepwise — both with and without the factor), obtained similar or better Sharpe ratios but 
with a large mean return, by assigning large weight to roughly three times more assets. As 
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a result, these LARS estimators obtain a much lower tracking error and higher correlation 
to the market. 

So when appropriate factors are available it makes sense to use them, and the best way to 
do so is via monomvn. It would seem that the one-factor LARS based monomvn estimators give 
the best results in the study, overall, with lasso in the top spot. It is reassuring to notice that, 
when an appropriate factor is not available, the LARS based monomvn methods, and PCR, 
give largely similar results by incorporating all of the available returns in a parsimonious 
way. This is not true in the case of the complete data estimators. 

Figure H] compliments Table [2] by showing the distribution (via boxplots) of the Sharpe 
ratios and the tracking error obtained for each of the 50 random paths through the 26 years. 
Recall that these were obtained by randomly sampling 250 qualifying assets in each year. 
The numbers in Table [2] are the means of data use to construct each boxplot, whereas the 
boxplots in the figure represent Monte Carlo approximations to the sampling distribution 
of portfolio characteristics under the various estimators of S. In short, the figure reinforces 
the superiority of the LARS estimators which, in addition to having large Sharpe ratios and 
small tracking error, also exhibit small variability with respect to Monte Carlo resampling. 
It is interesting to note that the LARS based estimators (without the factor) show the lowest 
variability in their Sharpe ratios amongst all monomvn estimators. 

It may be tempting to conclude that these results contradict the results of the ELL-based 
comparison(s) on synthetic data in Section 15.11 Indeed, in that section we saw that PCR 
seemed to be the best at recovering the (known) of the distribution which generated the 
training data. However, means, variances, Sharpe ratios, tracking error, etc., are specific 
statistics, and moreover they are obtained after a (highly non-linear) transformation into 
portfolio weights via quadratic programming. Therefore, we should expect to see different 
results, since these statistics represent utilities which are different from ELL. That being 
said, notice that PCR is still the best in terms of average annualized standard deviation 



29 









O 


o 


CO 

o 


O T 


o 

-r _l_ 
I ~T 1 

1 -•- _i_ 


o 

T- "J" 
1 1 


o 

T 

"T" -r J 

i nn t t t 

O 


o 

-r "T 

j _i_ 

_!_ q 


1 1 1 1 1 1 

.2 0.3 0.4 0.5 0.6 0.7 
Sharpe ratio 


1 1 
cd g 


i i i 

.E E E 
E o 


fmin - 
fcom - 
frm - 


plsr - 
per - 
ridge - 
lasso 

lar - 
step - 


f P - 

fplsr - 
fpcr - 
fridge 
flasso - 
flar - 
fstep - 


O 







-r 


o 
o 

* 6 

O 


e 

o 

-I- 
o 


1 1 1 
3 0.12 

ig error 




o 








_ q s 
o o 

CO 


T 










_ p 
o 



"i i i i i i i i i i i i i i i i i i i i r 



cr g c 

CD > C 



E E .E E E 
o >- p o j= 



co o 



CD O 



CL CL 
CD 3= 



cn o 
Q. S- 



CD O 

D) CO 

T3 « 

"C CO 



Figure 4: Boxplots of Sharpe ratios ^opj and the tracking error (bottom) obtained over 50 
random paths through the 26 years, obtained by randomly sampling 250 qualifying assets 
in each year. The averages of these numbers is what is reported in Table [2j The horizontal 
bars correspond to the vertical ones in that table. 



(and thus Sharpe ratio) [see Table [2] when no appropriate factors are available — but with 
high variability [see Figure Hj. Importantly, both experiments (here and in Section loTTi) show, 
resoundingly, that using all of the available data via monomvn is preferred over a complete 
data estimator. 
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5.3 Examining dependence relationships between assets 

For our final empirical analysis we shall demonstrate the descriptive power of monomvn. At 
the same time we shall take the opportunity to show how the method can be applied when 
there are thousands of assets. 

From Thomson Financial's Datastream (www.datastream.com), we have downloaded, in 
dollar terms, the total returns data of each stock in the Russell 3000® Index representing 
the broad United States equity universe encompassing approximately 98% of the market: 
1792 weekly returns between 12/01/1973 and 11/05/2007 for 2894 assets. In order to ob- 
tain a set of clean and complete data, each series is tested for illiquidity, completeness, and 
stationarity, using the following methodology. We removed assets which were marked to 
market at a frequency other than weekly, to exclude illiquid assets that may exhibit arti- 
ficial serial correlation (this essentially excludes any stock that has more than two weeks 
of co nsecutive unchanging pri ces at any point in time). Then, an augmented Dickey Fuller 



test (IDickey and Fuller 



19791 ) is employed to exclude any of the assets that exhibit non- 
stationarity (six lags have been tested at the 99% confidence level). A total of 2461 stocks 
remained after applying these two filtering steps. There are 558 assets with longest history 
of 1792 returns; the least observed asset has only 76 returns (so the "complete" estimator(s) 
can use only 3% of the data); the overall proportion of missing observations was 0.472. 

We consider applying the lasso version of the monomvn algorithm to this data, with p = 0, 
i.e., always use the lasso (never use OLS). As we have mentioned, the lasso (and other LARS 
methods) have descriptive (as well as predictive) power because they can provide (3 with 
many coefficients set to zero. In the context of the monomvn algorithm this means that the 
MLE S may have zero entries, indicating marginally uncorrelated assets, and moreover may 
have block-diagonal structure (or zeros in £ ) indicating a pairwise conditional indepen- 
dence of assets. Since ridge regression, PCR, and PLSR always yield > 0, they would 
never produce a zero in S or S , and so would be less useful for creating such qualitative 
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Figure 5: Histograms of the number of zeros in each column of S (left) and X (right). 

summaries of the relationships between asset returns. It may be tempting to interject zeros 
where there are small values in S or X , but like the "complete" and "observed" estimators, 
the resulting matrix would not usually be positive definite. Moreover, classical pairwise tests 
for independence, say via the Pearson product-moment correlation coefficient, would give 
unrealistic results. With return histories as short as ~ 80 weeks and estimated correlation 
less than about 0.2, a simple calculation shows that there would not be enough evidence to 
reject the hypothesis that the correlation is zero. 

The estimator obtained using the lasso on this data yields a S with 36% of its entries set 
to zero. Moreover, 50 of its 2641 columns (or 2%) are everywhere zero except in the diagonal 
position. This means that 36% of asset pairings are marginally uncorrelated. Investigating 
pairwise correlation between assets, conditional on all of the others, involves looking for 
zeros in E , of which we find 140 (or 6%). This means that the rows/columns of E can be 
reordered so that the matrix has block-diagonal structure, and that the returns of 6% of the 
assets are conditionally independent. Figure [5] shows histograms summarizing the number 
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of zeros in each column of X and X . Every column in both matrices had at least one zero 
entry. The figure clearly illustrates that the resulting correlations can be used to cluster the 
assets, but this is beyond the scope of this paper. 

To wrap up the experiment we downloaded the market returns available from the Russel 
3000 index for 1479 (of 1792) contiguous weeks ending 11/5/2007 and used them to create 
a residual return series for each of the 2461 assets in our study. We then re-ran the lasso 
experiment, above, to discover that 58% of the asset parings are marginally uncorrelated and 
14% are conditionally independent when the market is taken into account. The histograms 
corresponding to this experiment are similar to those for the initial one, in Figure and so 
they are not reproduced here. 



6 Discussion 



We have shown how the methods of 



Stambaughl (119971 ) can be applied for large numbers of 



assets whose histories are (nearly) unconstrained in length. The key insight is in replacing 
OLS regressions with more parsimonious ones that either use derived input directions or 
apply some sort of shrinkage. Whereas Stambaugh demonstrated his methodology on 22 
assets, we have shown how the monomvn algorithm — essentially the same methodology with 
a different regression method — can handle thousands. We argued that even when OLS 
regressions suffice, the more parsimonious ones can offer improvements in both accuracy and 
interpretation. We also argued that it is advantageous to let a model selection method (e.g., 
parsimonious regression) decide which dependencies between factors and returns exist, as 
op posed to assuming a classical factor model structure. 



ISI^ (e.g. 



Schaferl . 



Stambaughl (119971) sho wed that by applying the standard noninformative prior it (9) oc 



19971 . pp. 154) it is possible to turn the MLEs fi and E into moments 



H = fi and S ^ E of a Bayesian posterior (predictive) distribution that, when used in the 
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mean-variance framework, are said to take estimation riskinto account. We note that, due to 
the notation used in that paper, it is a common misconception that these posterior moments 
forecast the ML estimates into the future. Since Stambaugh employs the i.i.d. assumption in 
the same way that we do in Eq.([T]), these are only moments of the posterior for 6 conditioned 
on the available historical data. Therefore, time is irrelevant, so the moments apply to 
the past as well without modification. Finally, to label this approach as "Bayesian" is an 
overstatement. While Stambaugh is correct to note that estimates of the mean vector and 
covariance matrix are all that are needed within the mean-variance framework, what results 
is a point-estimate (vector) of optimal portfolio weights, not (samples from) a Bayesian 
posterior distribution, as would be ideal. The challenge is that while the moments of the 
posterior have a nice closed form, the distribution itself does not. Further challenges limit 
the application of this approach in the "big p small n setting" . In this situation the standard 
noninformative prior leads to an improper posterior. This can be most easily seen in the 
calculation of Stambaugh's V = T, (in our notation) in Eq. (69-71), pp. 302, where the 
resulting diagonal would be negative. 

Stambaugh's Bayesian approach is not the only way forward. It is possible to obtain 
the sampling covariance matrix of fi analytically. Howev er, an analyt i c form for the sam- 



Hastie et al. 



2001 



Sections 7.11 



pling variability of S is not known. The bootstrap (e.g. 
& 8.2) offers a Monte Carlo method for quantifying the stability of S via its component- 
wise confidence intervals. We took a related approach at the end of Section 15.21 to exam- 
ine how variability in X, arising from random subs amples of 250 assets, filt ers through to 



Little and Rubinl ( 2002 



Section 7.4.4) 



the properties of the balanced portfolios. However, 
make a strong argument in preference for a fully Bayesian approach instead. Facilitating 
tractable Bayesian estimation for parsimonious regression algo rithms, as would be req uired 



by monomvn, presents a serious challenge. The 



so-called Bayesian latent factor models (jWest 



. 3ayes ian lasso fjPark and Casella 



2008|) and 



20031 ). which can be seen as a Bayesian ex- 
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tension of principal components and partial least squares regressions, have received much 
attention in the recent literature. Exploring the extent to which these can be applied within 
the monomvn algorithm to get samples from the posterior distribution of fi and ~S is part of 
our ongoing work. These samples can accurately reflect the estimation risk in mean-variance 
portfolio allocation by filtering the uncertainty though the optimization to get a distribution 
on the simplex of portfolio weights. 

Another interesting extension would involve relaxing the assum ption of (m ultivariate) 



normality, i.e., to decou ple the depend ence distribution, or copula (ISklar 



marginals. In this regard, 



19571 ). from the 



Pattonl (120061 ) has made promising inroads into applying copulas to 



a pair o 



(INelsen 



retur n series under a monotone missingness pattern. Although the theory for copulas 



19991 ) naturally extends beyond two dimensions, the application of the methodology 
quickly becomes intractable without enforcing severely restrictive assumptions. Our ongo- 
ing work includes identifying ways in which the monomvn algorithm for high-dimensional 
estimation under monotone missingness may be extended to support marginal Student-t 
distributions and GARCH models with various parametric forms of the copula. While there 
is pl enty of evide nce in the literature against the assumption of normality for asset returns 



e.g. 



Mills 



19271 ). we argued that the most important thing is to be able to make use of all 



of the available data with an algorithm that is computationally tractable. 
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